3 research outputs found

    Application of Query-Based Qualitative Descriptors in Conjunction with Protein Sequence Homology for Prediction of Residue Solvent Accessibility

    Get PDF
    Characterization of relative solvent accessibility (RSA) plays a major role in classifying a given protein residue as being on the surface or buried. This information is useful for studying protein structure and protein-protein interactions, and it is usually the first approach applied in the prediction of 3-dimensional (3D) protein structures. Various complicated and time-consuming methods, such as machine learning, have been applied in solvent-accessibility predictions. In this thesis, we presented a simple application of linear regression methods using various sequence homology values for each residue as well as query residue qualitative predictors corresponding to each of the 20 amino acids. Initially, a fit was generated by applying linear regression to training sets with a variety of sequence homology parameters, including various sequence entropies and residue qualitative predictors. Then the coefficients generated via the training sets were applied to the test set, and, subsequently, the predicted RSA values were extracted for the test set. The qualitative predictors describe the actual query residue type (e.g., Gly) as opposed to the measures of sequence homology for the aligned subject residues. The prediction accuracies were calculated by comparing the predicted RSA values with NACCESS RSA (derived from X-ray crystallography). The utilization of qualitative predictors yielded significant prediction accuracy

    Logistic regression models to predict solvent accessible residues using sequence- and homology-based qualitative and quantitative descriptors applied to a domain-complete X-ray structure learning set

    Get PDF
    A working example of relative solvent accessibility (RSA) prediction for proteins is presented. Novel logistic regression models with various qualitative descriptors that include amino acid type and quantitative descriptors that include 20- and six-term sequence entropy have been built and validated. A domain-complete learning set of over 1300 proteins is used to fit initial models with various sequence homology descriptors as well as query residue qualitative descriptors. Homology descriptors are derived from BLASTp sequence alignments, whereas the RSA values are determined directly from the crystal structure. The logistic regression models are fitted using dichotomous responses indicating buried or accessible solvent, with binary classifications obtained from the RSA values. The fitted models determine binary predictions of residue solvent accessibility with accuracies comparable to other less computationally intensive methods using the standard RSA threshold criteria 20 and 25% as solvent accessible. When an additional non-homology descriptor describing Lobanov–Galzitskaya residue disorder propensity is included, incremental improvements in accuracy are achieved with 25% threshold accuracies of 76.12 and 74.45% for the Manesh-215 and CASP(8+9) test sets, respectively. Moreover, the described software and the accompanying learning and validation sets allow students and researchers to explore the utility of RSA prediction with simple, physically intuitive models in any number of related applications
    corecore